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ABSTRACT 

In the era of Big Data, omic-scale technologies, and 
increasing calls for data sharing, it is generally agreed 
that the use of community-developed, open data 
standards is critical. Far less agreed upon is exactly 
which data standards should be used, the criteria by 
which one should choose a standard, or even what 
constitutes a data standard. It is impossible simply 
to choose a domain and have it naturally follow which 
data standards should be used in all cases. The 'right' 
standards to use is often dependent on the use case 
scenarios for a given project. Potential downstream 
applications for the data, however, may not always be 
apparent at the time the data are generated. Similarly, 
technology evolves, adding further complexity. Would-be 
standards adopters must strike a balance between 
planning for the future and minimizing the burden 
of compliance. Better tools and resources are required 
to help guide this balancing act. 
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BACKGROUND 

Members of the scientific community are increas- 
ingly expected to share data, and to do so in a 
standards-comphant manner. This is evidenced by 
the recent mandates, announcements, and requests 
for information by the funding agencies^"^ and 
journals,^ and numerous essays and announcements 
by the scientific community, including pre- 
competitive initiatives by the life science industry/^ 
HoM^ever, the scientific community is not necessarily 
well poised to comply/^ All stakeholders — funders, 
journal editors, researchers and those supporting 
them, struggle to navigate the existing standards and 
make informed decisions/"^ As an example, in 2009 
one of our groups aimed to create a standards- 
compliant, integrated data repository for clinical and 
'omics' data, among other types. This begged the 
question: w^ith w^hich standards should we comply? 
Through subsequent efforts to answ^er this question, 
three key points have become clear: 

1. Different groups and individuals have different 
definitions for w^hat constitutes a 'data standard'. 

2. Even w^ithin one domain, no one standard is the 
'right' standard across all cases; rather, one must 
select a standard (or even specific pieces of a 
standard) based on one's particular needs. 

3. Integrated resources and registries are needed to 
help researchers navigate the fluid standards land- 
scape and to choose and implement the right 
standard for their respective project. 

The focus for that project was on omics data 
standards, but these points apply across the spectrum 
of biomedical data types. High-dimensional 'Big 
Data' equate to large numbers of parameters, w^hich 
in turn require yet more data for sufficient statistical 
powder. Importantly, this massive amount of data 
lends itself to many different analytic approaches. 



putting comprehensive analysis beyond the capabil- 
ities of any one researcher. The size and complexity 
of these data, combined w^ith growling scarcity of 
research funding and the quest for personalized 
medicine, make it increasingly important to maximize 
the utility of research dollars through data sharing and 
re-use. Efforts to this end are demonstrated by a spate 
of new^ data sharing and aggregation initiatives by aca- 
demics, private-public partnerships, and publishers, 
for example Sage Bionetworks,^^ the Pistoia Alliance 
(http://vww.pistoiaalliance.org) and DRYAD,^^ among 
others. ^^"^^ At the national level in the USA, the data 
sharing trend is reflected in programs such as the 
National Institutes of Health's (NIH) recently 
announced 'Big Data to Knowledge' (BD2K) initia- 
tive,^^ and the White House office of science and 
technology policy's recent directive that the results of 
government-funded research be made publicly avail- 
able.^^ The Innovative Medicines Initiative (http:// 
vww.imi.europa.eu/) is Europe's largest public-private 
initiative that supports collaborative research projects 
and builds networks of industrial and academic 
experts in order to boost pharmaceutical innovation in 
Europe. Internationally, the Research Data Alliance 
(https://rd-alliance.org/) has been established by an 
international steering group from funding agencies in 
the USA, EU and Australia; and recently the global 
alliance for genomic and clinical data sharing has 
brought together over 70 leading healthcare, research, 
and disease advocacy organizations, involving research- 
ers from more than 40 countries, to enable secure 
sharing of genomic and clinical data.^^ 

These types of initiatives, together w^ith the 
evolving portfolio of grass-roots standards, have 
enhanced the need to maximize aw^areness and disco- 
verability of standards. Such efforts are becoming 
more common,^^"^^ but they lack integration or uni- 
fication. There is a clear need for some level of 
coordination, w^ithout taking the form of a top-dow^n 
authority. How^ can we avoid requiring w^ould-be 
standard adopters to spend considerable time and 
effort becoming w^ell versed w^ith a multitude of stan- 
dards solely in order to rule most of them out? 

WHAT IS A DATA STANDARD? 

The International Organization for Standardization 
defines a standard as '...a document that provides 
requirements, specifications, guidelines or charac- 
teristics that can be used consistently to ensure that 
materials, products, processes and services are fit 
for their purpose'. Standards range from de jure, 
that is, ordained by some official organization such as 
the International Organization for Standardization or 
the American National Standards Institute, to de 
facto, that is, developed by grass-root initiatives and 
commonly adopted, but not prescribed by an official 
or specific authority. The BioSharing registry (http:// 
biosharing.org/) houses a fairly comprehensive. 



200 



Tenenbaum JD, et al.JAm Med Inform Assoc 2014;21:200-203. doi:1 0.1 136/amiajnl-201 3-002066 



Perspective 



curated list of data standards (primarily de facto) in the life 
science, environmental, and biomedical space. These standards are 
divided into three categories. First, content standards take the form 
of reporting guidelines, for example, minimum information check- 
lists. These vary from general guidance to itemized prescriptions of 
the information that should be provided (ie, curation guidelines), 
including both data and metadata. The second category consists of 
syntax standards in the form of representations and formats that 
facilitate the exchange of information. These fall broadly into two 
types: delimited text, or a 'markup language' such as XML. Third 
are the semantic standards in the form of terminology artifacts, 
such as controlled vocabularies or ontologies. These add an inter- 
pretive layer to the data by defining the concepts or terms in a 
domain, and in some cases the relationships between them. 

Other discussions of standards include the notion of a data 
model, which extends beyond terms and their definitions to 
describe the relationships between concepts in a domain. 
Other groups also use additional terms such as conceptual 
model, conceptual schema, ontology, or domain analysis 
model,^^"^^ but generally differ on what each of these terms 
means. This is in fact part of the confusion — even data stand- 
ard experts do not agree on what constitutes a data standard. 
Nevertheless, focusing just within the context of transcrip- 
tomics, preliminary investigation yielded a list of 15 potentially 
relevant standards (table 1). Note that this Hst could grow 
depending on the type of sample and organism used, as many 
terminologies are species specific. Now imagine if a researcher 
has an associated dataset from a proteomics investigation, for 
example. How is a mere mortal to sort through these? 

FIT FOR PURPOSE 

In biomarker discovery, the phrase 'fit-for-purpose' refers to the 
notion that the degree of rigor for assay validation should be 
tailored to the intended purpose of a given biomarker study.^^ 
The same is true for data standards adoption. While each 



individual project will inevitably have its own specific require- 
ments, it can be useful to group projects across a spectrum of 
rigor. At the lowest level, there is the use case of data sharing 
within a laboratory or between collaborators. While minimum 
information guidelines should be followed, for the most part 
any documentation need only be human readable, and issues 
requiring clarification are merely a walk down the hall or an 
e-mail away (at least until the student graduates or the postdoc 
moves on). Data that are to be shared publicly, for example, 
accompanying a publication, require more rigor. Ideally, a pro- 
spective consumer of the data can both understand and repro- 
duce those data without needing to contact the original author. 
Furthermore, much of the content of publications is now aggre- 
gated and curated by various onHne resources. These value- 
added services can be much more efficient and effective at 
making content available via secondary sources when quality 
data standards are used. Minimally structured data can be very 
helpful for such purposes; for example, the use of a unique 
identifier to describe a molecule or a standardized vocabulary 
term to denote the disease area under study. The highest level of 
rigor is needed for contribution of data to a structured data 
repository. In this case, additional effort is warranted in the 
form of structured fields and a standardized, machine-readable 
format. Such rigor enables querying across multiple datasets and 
integrative meta-analysis combining more than one set. 

One key point in differentiating between these levels of rigor 
is that there are different 'flavors' of annotation. At every level, 
there is a difference between what needs to be documented, and 
what needs to be documented in a structured and queryable 
fashion. While the option exists to select a standard that allows 
for maximum structure and adopt it only loosely, complexity 
can turn off would-be standards adopters, as well as waste time 
in development if such rigor will ultimately never be needed. 

Categories of criteria to be used in evaluating data standards 
for adoption include: 



Table 1 A sampling of (some of the) standards related to microarray-based transcriptomics, generated by non-experts for evaluation of 
relevance to a project involving microarray-based transcriptomics data 



standard Type Description 



MIAME 


Reporting guideline 


Minimum Information About a Microarray Experiment 

Specifies six components that must be included to describe a microarray experiment, for example, raw and processed data, 
experimental design, sample annotation, protocols. MIAME does not specify how these components must be represented, 
for example, in any given format, or using any given terminology 


ISA-TAB 


Exchange format 


Generic format for experimental representations; conversion tools to MAGE-Tab, MIMiML and other formats exist 


MAGE-TAB 


Exchange format 


MicroArray and Gene Expression-Tabular 

Simple tab-delimited, spreadsheet-based format. Used by ArrayExpress 


MAGE-ML 


Exchange format 


MicroArray and Gene Expression-Markup Language. No longer supported 


SOFT 


Exchange format 


Simple Omnibus Format in Text. Line-based, plain text format designed for rapid batch submission of data. Used by GEO 


MIMiML 


Exchange format 


MIAME Notation in Markup Language. Optimized for microarray and other high-throughput molecular abundance data 
Used by GEO 


GO 


Terminology artifact 


Gene Ontology. Controlled vocabulary for annotation of gene function and cellular location. Part of the OBO Foundry 


EFO 


Terminology artifact 


Experimental Factor Ontology. Provides a systematic description of many experimental variables. Used by ArrayExpress 


OBI 


Terminology artifact 


Broader scope for experimental representations. Part of the OBO Foundry 


MGED Ontology 


Terminology artifact 


Integrated in OBI 


MAGE-OM 


Object model 


MicroArray and Gene Expression — Object Model. The object model from which MAGE-ML was derived 


FuGE 


Object model 


Generic object model for functional genomics 


SEND 


Exchange format 


Standard for Exchange of Nonclinical Data — an implementation of the CDISC (Clinical Data Interchange 
Standards Consortium) SDTM (Standard Data Tabulation Model) 


GEML 


Exchange format 


These three standards have since been deprecated and/or replaced by other standards, but that progression may 


FUGO 


Terminology artifact 


not always be clear to novice users 


MAML 


Exchange format 
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► The standard itself 

- specification documentation 

- ease of implementation (eg, level of documentation, 
requirement for programmer support) 

- human and machine readability 

- formal structure 

- expressivity — the breadth of information that can be 
represented 

- ease of use, for example, minimal required fields, text- 
based interface familiarity to biologists. 

► Adoption and user community 

- broad adoption and implementation, outside the initial group 

- support suppHed by the user community 

- use by community databases 

- software development that supports the standard (eg, for 
curating, submitting to databases) 

- responsiveness to community requests 

- availability of examples of use 

- requirements of relevant authoritative bodies, for example, 
funders (NIH, National Science Foundation, Centers for 
Medicare & Medicaid Services), pubHshers, etc. 

► Additional factors 

- integration/compatibility w^ith other standards 

- extensibility and flexibility to cover new^ domains 

- conversion and mapping, w^hen applicable 

- cost (eg, open vs Hcensing fee). 

Of course, specific projects may have additional criteria to add, 
and different projects w^ill place different w^eight on the different 
items. Unfortunately, standards adoption, when it happens, is 
often determined less by an objective criteria-based evaluation 
and more based on historical precedent ('my advisor used stand- 
ard X'), marketing ('I saw^ a press-release about standard X') or 
sociopolitical circumstance ('I know^ someone on the standard X 
team'). What makes it even more difficult to select standards 
empirically, based on objective criteria, is that standards are often 
complex. Even w^ell-documented standards can be dense and 
impenetrable to prospective users who were not involved in their 
development. This is one reason why standards are often dupli- 
cated or reinvented. Other factors include the desire for some 
level of control, or recognition for doing the w^ork. 

RESOURCES WANTED 

The recent data and informatics w^orking group report to the 
advisory committee to the director of the NIH included recom- 
mendations to establish a minimal metadata framew^ork for data 
sharing, and to create catalogs and tools to facilitate data 



sharing.^ A truly minimal set of metadata elements is important 
if w^e are to have any hope of compliance because the activation 
energy required for data curation and annotation represents a sig- 
nificant hurdle in facilitating data sharing. The minimum infor- 
mation for biological and biomedical investigations (MIBBI) 
project, part of the broader BioSharing effort, w^orked w^ith dif- 
ferent research communities to coordinate their 'minimum infor- 
mation' checklists,^^ but each community has some unique 
requirements. Also, data annotation presents an inherent tension: 
the easier we make it for investigators to annotate their datasets, 
the harder it w^ill be to ensure discoverability. Conversely, the 
more discoverable we make the datasets, for example, through 
annotation using controlled terminologies, the more burden w^e 
put on the data generators. 

Researchers need better tools and resources to identify, evalu- 
ate, and implement standards. BioSharing is a great resource to 
register and discover standards, and has adopted the initial set 
of criteria described above, requiring the communities to do a 
self-appraisal and tag their entries accordingly. The standards 
development community also has an active role to play if they 
M^ish to maximize the use and uptake of their vsAork. Revievs^ers 
of publications and associated adherence to data standards 
should include biocurators. In the absence of w^idely agreed 
upon metrics to evaluate community standards, the decision 
about w^hich is the right standard falls on the researcher. For 
reasons described above, this situation is problematic. Table 2 lists 
some potential resources/functionalities to address this problem. 
For any of these resources, it is important to note that technology 
is dynamic, and therefore so are any associated standards. 
Relevant resources must be similarly dynamic and up to date. 

DISCUSSION 

While one can conjure up motivating scenarios from a regula- 
tory or archiving standpoint, the value proposition behind 
adherence to standards only really makes sense if data are to be 
shared beyond the team that originally created them. Thanks in 
part to poHcies put in place by some funders and publishers,^ 
many high throughput datasets are made publicly available and, 
at some level, standards compliant. How^ever, these policies 
have a number of restrictions that make them fall short. Some 
apply only to data generation through grants that exceed 
US$500 000.^ Some require only a very low bar of compliance, 
and data are still difficult if not impossible to interpret. In many 
cases, the policies are simply not enforced,'^ although the gov- 
ernment and the NIH have recently taken steps to rectify that 
fact.^ ^« 



Table 2 Potential resources to assist in the selection and adoption of appropriate standards 



Resource 



Notes 



Lay person's primer to 
standards 

'Consumer reviews' 



Standard-selection wizard 



Standards-adoption 
'helpdesk' 

Quality assurance tools 



This would be a text document for the lay person to describe the standard, what problem it helps solve, and how it achieves that. Although 
FAQs address a number of these questions, one must first identify the standard and find the respective FAQ. This would be a centralized 
collection of documentation that requires no previous knowledge 

This would be a rating system along the lines of Amazon product reviews. Ontology registries such as the NCBO and the OBO Foundry 
enable or perform reviews, but the reviews are few in number, not substantive, or infrequent. As discussed above, the utility of a 
standard depends on the purpose for which it is being used, so information beyond numeric scores is needed 
Decision support methods could be used to ask a researcher about the intended goals and make recommendations accordingly. 
For example, 'what instrument type was used to generate the data?' and, 'will these data be deposited in a public data repository? 
If so, which one?' etc. Clearly this would require significant resources and ongoing maintenance 

This would be a centralized resource of real humans with expertise across a number of standards. Once a standard has been selected, 
many have rich user communities and distribution lists for help with questions. However, for an individual investigator who wants to be 
standards-compliant and does not know where to begin, expert advice can save significant time in researching options 
Similar to syntax validators such as for RDF, tools to gauge or validate standards compliance are useful for data submitters as well as reviewers 



NCBO, National Center for Biomedical Ontology (http://www.bioontology.org/); RDF, Resource Description Framework. 
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Ideally, it should be noted, researchers themselves would be 
shielded from the complexity of data standards. Developers, 
informaticists, and curators are perhaps better equipped to delve 
into data standards than w^ould be a clinician or bench scientist, 
but even they are typically not experts in specialized standards. In 
an ideal w^orld, data generators w^ould have access to user-friendly 
tools that enable the seamless use of relevant standards and can 
be customized to fit the different data and domain needs.^ The 
actual standards w^ould be hidden from the data generators, and 
their use made automatic through intuitive, user-friendly tools. 

Although we have described some tools for the discovery and 
evaluation of standards if one is so inclined, the real challenge is 
incentivizing researchers to go to the trouble. This w^ill probably 
need a combination of proverbial carrots and sticks. On the 
penalty side, funders and publishers must continue to develop and 
publicize progressive data-sharing policies, and to enforce those 
policies through the delay of publication or future funding, if 
necessary. On the incentives side, a formal system for data citation 
must be developed, and those citations acknow^ledged and valued 
by funders, professional organizations, and university promotion 
and tenure committees. Recent activity in the realm of data pub- 
lishing has been an important first step.^^ Only w^hen obstacles 
are minimized and incentives are properly aligned w^ill investiga- 
tors be able to justify the effort required to do the right thing. 

Correction notice This article has been corrected since it was published Online 
First. In the Discussion section 'US$500 million' has been changed to 'US$500 000.' 
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